6 research outputs found

    Distributed query-aware quantization for high-dimensional similarity searches

    Get PDF
    The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest Neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic. There is potential to improve accuracy when a query-dependent quantization is used. In this paper we propose a Query dependent Equi-Depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions

    Faster Multidimensional Data Queries on Infrastructure Monitoring Systems

    Get PDF
    The analytics in online performance monitoring systems have often been limited due to the query performance of large scale multidimensional data. In this paper, we introduce a faster query approach using the bit-sliced index (BSI). Our study covers multidimensional grouping and preference top-k queries with the BSI, algorithms design, time complexity evaluation, and the query time comparison on a real-time production performance monitoring system. Our research work extended the BSI algorithms to cover attributes filtering and multidimensional grouping. We evaluated the query time with the single attribute, multiple attributes, feature filtering, and multidimensional grouping. To compare with the existing prior arts, we made a benchmarking comparison with the bitmap indexing, sequential scan, and collection streaming grouping. In the result of our experiments with large scale production data, the proposed BSI approach outperforms the existing prior arts: 3 times faster than the bitmap indexing approach on single attribute top-k queries, 10 times faster than the collection stream approach on the multidimensional grouping. While comparing with the baseline sequential scan approach, our proposed algorithm BSI approach outperforms the sequential scan approach with a factor of 10 on multiple attributes queries and a factor of 100 on single attribute queries. In the previous research, we had evaluated the BSI time complexity and space complexity on simulation data with various distributions, this research work further studied, evaluated, and concluded the BSI approach query performance with real production data

    Hybrid query optimization for hard-to-compress bit-vectors

    Get PDF
    Bit-vectors are widely used for indexing and summarizing data due to their efficient processing in modern computers. Sparse bit-vectors can be further compressed to reduce their space requirement. Special compression schemes based on run-length encoders have been designed to avoid explicit decompression and minimize the decoding overhead during query execution. Moreover, highly compressed bit-vectors can exhibit a faster query time than the non-compressed ones. However, for hard-to-compress bit-vectors, compression does not speed up queries and can add considerable overhead. In these cases, bit-vectors are often stored verbatim (non-compressed). On the other hand, queries are answered by executing a cascade of bit-wise operations involving indexed bit-vectors and intermediate results. Often, even when the original bit-vectors are hard to compress, the intermediate results become sparse. It could be feasible to improve query performance by compressing these bit-vectors as the query is executed. In this scenario, it would be necessary to operate verbatim and compressed bit-vectors together. In this paper, we propose a hybrid framework where compressed and verbatim bitmaps can coexist and design algorithms to execute queries under this hybrid model. Our query optimizer is able to decide at run time when to compress or decompress a bit-vector. Our heuristics show that the applications using higher-density bitmaps can benefit from using this hybrid model, improving both their query time and memory utilization

    Performance evaluation of word-aligned compression methods for bitmap indices

    Get PDF
    Bitmap indices are a widely used scheme for large read-only repositories in data warehouses and scientific databases. This binary representation allows the use of bit-wise operations for fast query processing and is typically compressed using run-length encoding techniques. Most bitmap compression techniques are aligned using a fixed encoding length (32 or 64 bits) to avoid explicit decompression during query time. They have been proposed to extend or enhance word-aligned hybrid (WAH) compression. This paper presents a comparative study of four bitmap compression techniques: WAH, PLWAH, CONCISE, and EWAH. Experiments are targeted to identify the conditions under which each method should be applied and quantify the overhead incurred during query processing. Performance in terms of compression ratio and query time is evaluated over synthetic-generated bitmap indices, and results are validated over bitmap indices generated from real data sets. Different query optimizations are explored, query time estimation formulas are defined, and the conditions under which one method should be preferred over another are formalized

    CLP: A Platform for Competitive Learning

    Get PDF
    We introduce the Competitive Learning Platform (CLP), an online continuous improvement tool that provides automatic partial performance feedback to students or groups of students on individual or collaborative assignments. CLP motivates students to do their best and come up with new solutions that can lead to improved assignment results before the assignment deadline. In this work, we describe the CLP system and present the results of a comprehensive set of analyses aimed at gauging the impact of utilizing this platform on student motivation, engagement, and performance. The analyses are based on a rich dataset containing CLP submission, student outcome, and student feedback data obtained from a variety of undergraduate and graduate classes using the tool at two universities over a period of five years. The sample includes 18 courses, 606 students, and 15782 CLP submissions. Results indicate that CLP is beneficial in this setting, leading to active student participation and improved motivation

    A tunable compression framework for bitmap indices

    No full text
    Abstract—Bitmap indices are widely used for large read-only repositories in data warehouses and scientific databases. Their binary representation allows for the use of bitwise operations and specialized run-length compression techniques. Due to a trade-off between compression and query efficiency, bitmap compression schemes are aligned using a fixed encoding length size (typically the word length) to avoid explicit decompression during query time. In general, smaller encoding lengths provide better compression, but require more decoding during query execution. However, when the difference in size is considerable, it is possible for smaller encodings to also provide better execution time. We posit that a tailored encoding length for each bit vector will provide better performance than a one-size-fits-all approach. We present a framework that optimizes compression and query efficiency by allowing bitmaps to be compressed using variable encoding lengths while still maintaining alignment to avoid explicit decompression. Efficient algorithms are introduced to process queries over bitmaps compressed using different encoding lengths. An input parameter controls the aggressiveness of the compression providing the user with the ability to tune the trade-off between space and query time. Our empirical study shows this approach achieves significant improvements in terms of both query time and compression ratio for synthetic and real data sets. Compared to 32-bit WAH, VAL-WAH produces up to 1.8× smaller bitmaps and achieves query times that are 30 % faster. I
    corecore